# Class 08: Supervised Machine Learning - Classification




In [76]:
import statistics
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from urllib.request import urlopen

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Machine Learning:  Features (X) and labels (y)

In supervised machine learning, we use a computer algorithm called a "pattern classifier" to learn relationships between a set of features X, and a label y. When the classifier is given new examples X, it can then make new predictions y. 


In [77]:
penguins = sns.load_dataset("penguins")

penguins = penguins.dropna()

penguins = penguins.sample(frac = 1)

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
17,Adelie,Torgersen,42.5,20.7,197.0,4500.0,Male
52,Adelie,Biscoe,35.0,17.9,190.0,3450.0,Female
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
211,Chinstrap,Dream,45.6,19.4,194.0,3525.0,Female
130,Adelie,Torgersen,38.5,17.9,190.0,3325.0,Female


In [78]:
# Let's explore how many different members there are of each species in our data set? 

species_count = penguins.groupby("species").agg(counts = ("species", "count"))
species_count



Unnamed: 0_level_0,counts
species,Unnamed: 1_level_1
Adelie,146
Chinstrap,68
Gentoo,119


#### Questions: 

1. If we had to guess the species of the penguin without knowing any of the penguin's features, which species of penguin should we guess? 

Answer: Always guess Adelie


2. If we were to following the optimal guessing strategy, what percent of our guess would be correct (i.e., what would our classification accuracy be)?

Answer: 48.67%
   


In [79]:
# get proportion that are a particular species

species_prop = species_count/len(species_count)
species_prop 



Unnamed: 0_level_0,counts
species,Unnamed: 1_level_1
Adelie,48.666667
Chinstrap,22.666667
Gentoo,39.666667


To begin the classification process, let's store the features (X) and the labels (y) in separate names called `X_penguin_features` and `y_penguin_labels` respectively. 

In [80]:
# get the features and the labels

X_penguin_features = penguins[['bill_length_mm', 
                               'bill_depth_mm',
                               'flipper_length_mm', 
                               'body_mass_g']]

y_penguin_labels = penguins['species']




## 2. k-Nearest Neighbors classifier

To explore classification, let's use a k-Nearest Neighbors classifier to predict the species of a penguin based on particular features the penguin has such as the penguin's bill length and body mass. 

Let's construct a K-Nearest Neighbor classifier (KNN) using 5 neighbors for predictions (i.e., k = 5 so we are using a 5-Nearest Neighbor classifier). 

We can do this using the `KNeighborsClassifier(n_neighbors = )` function.  



In [81]:
from sklearn.neighbors import KNeighborsClassifier

# Construct a classifier a 5 nearest neighbor classifier
knn = KNeighborsClassifier(5)


Let's now train the classifier (the KNN classifier just stores the data during training)


In [82]:
# “train” the classifier (which for a KNN classifier just involves memorizing the training data)

knn.fit(X_penguin_features, y_penguin_labels)


Let's now use the classifier to make predictions

In [83]:
# make predictions

predictions = knn.predict(X_penguin_features)
predictions[0:5]


array(['Adelie', 'Adelie', 'Gentoo', 'Chinstrap', 'Adelie'], dtype=object)

Let's get the prediction (classificaton accuracy) which is the proportion of predictions that are correct

In [84]:
# get the classification accuracy

accuracy = np.mean(predictions == y_penguin_labels)
accuracy


np.float64(0.8378378378378378)

Let's repeat our analysis with k = 1 to see what happens...

In [85]:
# What happens if k = 1?


# construct a classifier
knn = KNeighborsClassifier(1)


# “train” the classifier (which for a KNN classifier just involves memorizing the training data)
knn.fit(X_penguin_features, y_penguin_labels)


# make predictions
predictions = knn.predict(X_penguin_features)



# get classification accuracy
accuracy = np.mean(predictions == y_penguin_labels)
accuracy


np.float64(1.0)

Do we believe we have a perfect classifier???


## 3. Cross-validation

To avoid over-fitting, we need to split our data into a training and test set. 

The classifier "learns" the relationship between features (X) and labels (y) on the **training set**.

The classifier makes predictions on the features (X) of the **test set**. 

We compare the classifier's predictions on the test features (X) to the actual labels y, to get a more accuracy assessment of the **classification accuracy**.


Let's try this now...



In [86]:
# manually create a training with 250 examples, and a test set that has the rest of the data

X_train_manual = X_penguin_features[0:250]
y_train_manual = y_penguin_labels[0:250]


X_test_manual = X_penguin_features[250:]
y_test_manual = y_penguin_labels[250:]


print(X_train_manual.shape)
print(X_test_manual.shape)


(250, 4)
(83, 4)


In [87]:
from sklearn.model_selection import train_test_split

# split data into a training and test set

X_train, X_test, y_train, y_test = train_test_split(X_penguin_features,  
                                                    y_penguin_labels, 
                                                    random_state = 0)

print(X_train.shape)
print(X_test.shape)

X_train.head(3)






(249, 4)
(84, 4)


Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
329,48.1,15.1,209.0,5500.0
261,49.6,16.0,225.0,5700.0
244,42.9,13.1,215.0,5000.0


In [88]:
from sklearn.neighbors import KNeighborsClassifier


# construct a classifier
knn = KNeighborsClassifier(1)


# “train” the classifier (which for a KNN classifier just involves memorizing the training data)
knn.fit(X_train, y_train)



In [89]:
# get the predictions

predictions = knn.predict(X_test)




In [90]:
# Get the prediction accuracy 

accuracy = np.mean(predictions == y_test)
accuracy 

np.float64(0.8571428571428571)

In [91]:
# Test the classifier on the test set using the .score() method

# prediction accuracy on the test set
knn.score(X_test, y_test)




0.8571428571428571

In [92]:
# What happens if we test the classifier on the training set? 

knn.score(X_train, y_train)



1.0

### K-fold cross-validation

In k-fold cross-validation we split our data into k-parts (note, the k here has no relation to the k in k-Nearest Neighbor - it is just that k is a frequent letter to use in math to denote integer values).  

To run a k-fold cross-validation analysis, we train the classifier on k-1 parts of the data and test it on the remaining part. We repeat this process k times to get k classification accuracies. We then take the average of these results as our estimate of our overall classification accuracy. 

We can use the scikit-learn `cross_val_score()` to easily do this...


In [93]:
from sklearn.model_selection import cross_val_score


# construct knn classifier
knn = KNeighborsClassifier(1)


# do 5-fold cross-validation
scores = cross_val_score(knn, X_penguin_features,  y_penguin_labels, cv = 5)

print(scores)

print(scores.mean())





[0.86567164 0.82089552 0.85074627 0.81818182 0.8030303 ]
0.8317051108095883
